
Considering the various expensive house prices in the different neighborhood and boroughs of New York, a careful analysis of current house prices as well as future house market house price prediction are important for consideration when deciding on which house to buy. For this scenario, we will assume the investor is interested in buying residential homes in New York. Specifically, I am interested in analyzing the price of houses in heach neighborhood, amount of houses bought/sold, location analysis, an unsupervised clustering analysis, as well as a regression in order to predict future house value for the prospective neighborhoods. Machine learning tools such as unsupervised clustering and statistical regression are used in order to identify common neighborhoods and analyze curent and future property prices.
I will utilize the following tools and datasets to perform my analysis:
The following libraries and APIs will be utilized to analyze and visualize the data: Data analysis:
Data analysis tools are chosen in order to effectively read and work with Data Frames, perform mathematical analysis and create Machine Learning models. K-means clustering is used in order to group neighborhoods and possibly identify common available investments amongst different areas. It will prove to be useful in observing alternatives after finding a viable investment in another area. Statistical regression is used in order to predict future neighborhood house prices in order to identify trends and give information regarding future value of the investment.
For visualization, Folium is used for geospatial data analysis and the matplotlib and seaborn libraries are used for graphical analysis of numerical and categorical data.
Lastly, the foursquare and NYC OpenData APIs are queried in order to obtain surrouding venues data and geospatial data respectively.
First, let's analyze the House Sales by Neighborhood for different boroughs in New York. This information comes from the NYC OpenData API, containing information for house sales between 2010 and 2019.
The following table contains data in 9 categories formatted this way:
BOROUGH
Department of Finance determines the neighborhood name in the course of valuing properties. The common name of the neighborhood is generally the same as the name Finance designates. However, there may be slight differences in neighborhood boundary lines.
Plain Text
NEIGHBORHOOD Department of Finance determines the neighborhood name in the course of valuing properties. The common name of the neighborhood is generally the same as the name Finance designates. However, there may be slight differences in neighborhood boundary lines.
Plain Text
TYPE OF HOME
Total number of properties for that particular borough and neighborhood
Plain Text
NUMBER OF SALES
Total number of sales for that particular neighborhood
Number
LOWEST SALE PRICE Lowest sales prices for that particular neighborhood
Number
AVERAGE SALE PRICE Average sales prices for that particular neighborhood
Number
MEDIAN SALE PRICE Median sales prices for that particular neighborhood
Number
HIGHEST SALE PRICE
Highest sales prices for that particular neighborhood
Number
YEAR
Year of Summary Report
Plain Text
We can see the data has 5979 rows of data, containing information for lowest, average, median and highest sale price per year in each neighborhood. The data also makes a distinction between the type of home: One family, two family and three family homes.
Let's observe confirm the data types for each column.
Since all the columns contain object variables, let's transform them do more appropriate data types (As described on the data set category introduction).
Now all the data types are correctly converted.
Now that we have all the data correctly transformed, let's perform some exploratory descriptive analysis of the data set. Our objective is to find any initial patterns and gain insights on the data distribution, characteristics or trends for each neighborhood, borough or type of home as well as yearly trends.
Looking at the "type of home" category, it stands out that there are 9 categories, as these are too many. Let's look at the distribuion of categorical variables.
Clearly, "ONE/TWO/THREE FAMILY HOMES" are wrongly labeled with an extra space at the beginning in a couple of rows. Let's fix this.
From this, we can observe that we have information from 5 different boroughs, 249 unique neighborhoods, 6 types of home. Additionally, we can see that the average sale price for all homes is 1,207,115 USD with an average of 35.5 house sales for all years.
Let's look at the data visually and try to get more insights on each borough, neighborhood and type of home.
As we can see, Manhattan has significantly less house sales than the other boroughs, being Queens and Brooklin the ones with the most amount of house sales. This seems to inversely match with the average price of each house sale, indicating that a possible driver for the amount of sales is their expensive price.
Additionaly, it is clear that house prices and sales are on an increasing trend from 2010 as indicated both visually and thorugh correlations. A time period analysis of prices and sales seems to be relevant.
Looking at Manhattan, there is a strong positive correlation between sale price and year, however, the number of sales is not correlated with time.
Let's perform a simmilar analysis but considering the different home type subcategories.